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Abstract. This paper presents a novel projection-based adaptive algorithm for sparse signal and sys- 
tem identification. The sequentially observed data are used to generate an equivalent sequence of closed 
, convex sets, namely hyperslabs. Each hypcrslab is the geometric equivalent of a cost criterion, that 

\ quantifies "data mismatch" . Sparsity is imposed by the introduction of appropriately designed weighted 

5h ' (.1 balls. The algorithm develops around projections onto the sequence of the generated hyperslabs as 

Oh. 

well as the weighted £i balls. The resulting scheme exhibits linear dependence, with respect to the 
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unknown system's order, on the number of multiplications/additions and an 0{L log2 L) dependence on 
sorting operations, where L is the length of the system/signal to be estimated. Numerical results are 
also given to validate the performance of the proposed method against the LASSO algorithm and two 
very recently developed adaptive sparse LMS and LS-type of adaptive algorithms, which are considered 
' to belong to the same algorithmic family. 
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1. Introduction 

Sparsity is the key characteristic of systems whose impulse response consists of only a few nonzero 
O • coefficients, while the majority of them retain values of negligible size. Similarly, any signal comprising 
^ ] a small number of nonzero samples is also characterized as being a sparse one. The exploitation of 
^ . sparsity has been attracting recently an interest of exponential growth under the Compressed Sens- 
ing (CS) framework [1-3]. In principle, CS allows the estimation of sparse signals and systems using 
fewer measurements than those previously thought to be necessary. More importantly, identifica- 
>< : tion/reconstruction is realized with efficient constrained minimization schemes. Indeed, it has been 
. shown that sparsity is favored by li constrained solutions [4,5]. 

With only a few recent exceptions, i.e., [6-10], the majority of the proposed, so far, CS techniques 
are appropriate for batch mode operation. In other words, one has to wait until a fixed and predefined 
number of measurements is available prior to application of CS processing methods, in order to recover 
the corresponding signal/system estimate. Dynamic online operation for updating and improving 
estimates, as new measurements become available is not feasible by batch processing methods. The 
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development of efficient, online adaptive CS techniques is of great importance, especially for cases where 
the signal or system under consideration is time- varying and/or if the available storage resources are 
limited. 

The basic idea in [6, 7] is to use ii regularization, i.e., to add to a standard linear or quadratic 
loss function an extra penalty term expressed by means of the well-known £i norm of the unknown 
system/signal coefficients. Such an approach has been adopted for the classical LMS [6], and for the LS 
type [7] minimization problems. The resulting recursions for the time update use the current estimate 
and the information residing in the subgradient of the cost function (due to the non-differentiability 
of the ii norm) to provide the next estimate. 

This paper evolves along a different rationale compared to [6,7], and introduces a projection-based 
algorithm for sparse system identification and sparse signal reconstruction. The kick-off point is the set 
theoretic estimation approach, e.g., [11]. Instead of a single optimum, we search for a set of points that 
are in agreement with the available information, which resides in the training data set (measurements) 
as well as in the available constraints (the ii ball, in this case). To this end, as each new set of 
measurements is received, a closed convex set is constructed, which defines the region in the solution 
space that is in "agreement" with the current measurement. In context of the current paper, the shape 
of these convex sets is chosen to be a hyperslab. The resulting problem is a convex feasibility task, with 
an infinite number of convex constraints. The fundamental tool of projections onto closed convex sets 
is used to tackle the problem, following the recent advances on adaptive projection algorithms [12-14]. 
Instead of using the information associated with the subgradient of the ii norm, the ii constraint 
is imposed on our solution via the exact projection mapping onto a weighted ii ball. The algorithm 
consists of a sequence of projections onto the generated hyperslabs as well as the weighted ii balls. The 
associated complexity is of order 0{qL) multiplications/additions and 0{L\og2L) sorting operations, 
where L is the length of the system/signal to be identified and g is a user-defined parameter, that 
controls convergence speed and it defines the number of measurements that are processed, concurrently, 
at each time instant. The resulting algorithm enjoys a clear geometric interpretation. 

The paper is organized as follows. In Section 2 the problem under consideration is described and 
in Section 3 some definitions and related background are provided. Section 4 presents the proposed 
algorithm. The derivation and discussion of the projection mapping onto the weighted £i ball are 
treated in Section 5. The adopted mechanism for weighting the £i ball is discussed in Section 6. In 
Section 7, the convergence properties of the algorithm are derived and discussed. It must be pointed 
out that this section comprises one of the main contributions of the paper, since the existing, so far, 
theory cannot cover the problem at hand and has to be extended. In Section 8, the performance 
of the proposed algorithmic scheme is evaluated for both, time-invariant and time-varying scenarios. 
Section 9 addresses issues related to the sensitivity of the methods, used in the simulations, to non-ideal 
parametrization and, finally, the conclusions are provided in Section 10. The Appendices offer a more 
detailed tour to the necessary, for the associated theory, proofs. 
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2. Problem description 

We will denote the set of all integers, non-negative integers, positive integers, and real numbers 
by Z, Z>o, Z>o, and M, respectively. Given two integers ji,j2 £ ^, such that ji < j2, let ji, 72 := 
{ji, Ji + !,•••, J2}. 

The stage of discussion will be the Euclidean space M.^, of dimension L G Z>o. Its norm will be 
denoted by ||-||. The superscript symbol (■)"^ will stand for vector transposition. The ii norm of a 
vector h = [hi, . . . , hiY' E M.'^ is defined as the quantity \\h\\i-^ := J2i=i The support of a vector 
h is defined as supp(/i) := {i E 1, L : hi ^ 0}. The iQ-noTm of h is defined as the cardinality of its 
support, i.e., ||/i||£o := #supp(/i). 

Put in general terms, the problem to solve is to estimate a vector h^, based on measurements that 
are sequentially generated by the (unknown) linear regression model: 

yn = x^K + Vn, Vn G Z>o, (1) 

where the model outputs (2/n)nez>o C M and the model input vectors (£C„,)„g2>o C comprise the 
measurements and ('yn)nez>o is the noise process. Furthermore, the unknown vector is ^-sparse, 
meaning that it has S non-zero terms only, with S being small compared to L, i.e., S := ||/i'=i.||£o ^ L. 

For a finite number of measurements A^, the previous data generation model can be written compactly 
in the following matrix-vector form, 

y = XK + v, (2) 

where the input matrix X G M^^^ has as its rows the input measurement vectors, y := [1/1,2/2, • • • , UnY 1 
and V := [fi, ^2, • . • , v^Y ■ 

Depending on the physical quantity that represents, the model in (2) suits to both sparse signal 
reconstruction and linear sparse system identification: 

(1) Sparse signal reconstruction problem: The aim is to estimate an unknown sparse signal, 
h^, based on a set of measurements (training data), that are obtained as inner products of the 
unknown signal with appropriately selected input vectors, a;„, according to (1). The elements 
of the input vectors are often selected to be independent and identically distributed (i.i.d.) 
random variables following, usually, a zero- mean normal or a Bernoulli distribution [5]. 

(2) System identification problem: The unknown sparse system with impulse response 

is probed with an input signal n G Zi>o yielding the output values y„ as the result of 
convolution of the input signal with the (unknown) impulse response of the system. In agree- 
ment to the model of (1), the measurement (input) vector, at time ra, is given by Xn '■ = 
[Xn,Xn-i, ■ ■ ■ ,Xn-L+iY- matrix- vcctor formulation, and for a finite number of measure- 

ments, the corresponding measurement matrix X is a (partial) Toeplitz one having as entries 
the elements Tij = Xi+i-j, where i E 1, N and j E 1,L. The input signal vector, x, usually 
consists of i.i.d. normally distributed samples. The study of Toeplitz matrices, with respect to 



3 



their potential to serve as CS measurement matrices, has been recently intensified, e.g., [15,16], 
partially due to their importance in sparse channel estimation applications [17]. 

A batch approach to estimating a sparse h,* based on a limited number of measurements < L, is 
provided by the Least- Absolute Shrinkage and Selection Operator (LASSO): 

K = argmin^^ \\h\\,^<s W^h - . (3) 

In this case, /i* is assumed to be stationary and the total number of measurements, A^, needs to be 
available prior to solution of the LASSO task. 

In the current study, we will assume that /i* is not only sparse but it is also allowed to be time- 
varying. This poses certain distinct differences with regard to the standard compressive sampling 
scenario. The major objective is no longer the estimate of the sparse signal or system, based on 
a limited number of measurements. The additional requirement, which is often more hard to cope 
with, is the capability of the estimator to track possible variations of the unknown signal or system. 
Moreover, this has to take place at an affordable computational complexity, as required by most real 
time applications, where online adaptive estimation is of interest. Consequently, the batch sparsity 
aware techniques developed under the CS framework, solving LASSO or one of its variants, become 
unsuitable under time-varying scenarios. The focus now becomes to develop techniques that a) exploit 
the sparsity b) exhibit fast convergence to error floors that are as close as possible to those obtained by 
their batch counterparts c) offer good tracking performance and d) have low computational demands 
in order to meet the stringent time constraints that are imposed by most real time operation scenarios. 



3. Online estimation under the sparsity constraint 

The objective of online techniques is the generation of a sequence of estimates, (^n)nGZ>o) time, 
n, evolves, which converge to a value that "best approximates", in some sense, the unknown sparse 
vector /i^,. The classical approach to this end is to adopt a loss function and then try to minimize it 
in a time recursive manner. A more recent approach is to achieve the goal via set theoretic arguments 
by exploiting the powerful tool of projections. 

3.1. Loss function minimization approach. A well-known approach to quantify the "best approx- 
imation" term is the minimization of a user-deflned loss function 

\/n G Z>o, V/i G M^ e„(/i) := Cf\h) + Tn^"^^), (4) 

where £r"'* is computed over the training (observed) data set and accounts for the data mismatch, 
between measured and desired responses, and accounts for the "size" of the solution, and in the 
current context is the term that imposes sparsity. The sequence of user-deflned parameters (7n)nez>o 
accounts for the relative contribution of df'\ C^P^ to the cost in (4). Usually, both functions Ci^\ C^P^ 
are chosen to be convex, due to the powerful tools offered by the convex analysis theory. 
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For example, the study in [6] chooses Cr (h) := — /i^a:;„p, where Cs \h) := h G M-^, 

in order to obtain the ZA-LMS algorithm. The RZA-LMS scheme is obtained in [6] when setting 
C^r\h) := Ef=i log(l + e M^, while keeping the same 4"^. In [7], the sum Least Squares with 

a forgetting factor /3 is used in place of Ci"'^ and the ii norm in C^J^\ 

3.2. Set theoretic approach. In this paper, a different path is followed. Instead of attempting to 
minimize, recursively, a cost function that is defined over the entire observations' set, our goal becomes 
to find a set of solutions that is in agreement with the available observations as well as the constraints. 
To this end, at each time instant, n, we require our estimate hn to lie within an appropriately defined 
closed convex set, which is a subset of our solutions space and it is also known as property set. Any 
point that lies within this set is said to be in agreement with the current measurement pair (a;„,?/„). 
The "shape" of the property set is dictated by a "local" loss function, which is assumed to be convex. 
In the context of the current paper, we have adopted property sets that are defined by the following 
criterion 

Sn[e] := {/i G : \h^Xr, - 2/„| <e}, ne Z>o, (5) 

for some user-defined tolerance e > 0. Such criteria have extensively been used in the context of robust 
statistics cost functions. Eq. (5) defines a hyperslab, which is indeed a closed convex set. Any point 
that lies in the hyperslab generated at time n is in agreement with the corresponding measurement at 
the specific time instance. The parameter e determines the width of the hyperslabs. Fig. 1 shows two 
hyperslabs defined at two successive instants, namely, n and n — 1. 

Having associated each measurement pair with a hyperslab, our goal, now, becomes to find a point 
in that lies in the intersection of these hyperslabs, provided that this is nonempty. We will come 
back to this point when discussing the convergence issues of our algorithm. For a recent review of this 
algorithmic family the reader may consult [18]. 

To exploit sparsity, we adopt the notion of the weighted ii norm. Given a vector iu„ G M'^ with 
positive components, i.e., Wn,i > 0, Vi G 1, L, the weighted ii ball of radius 5 > is defined as [19] 

L 

5,Jit^„,5] :={^GM^: ^ tt^n.^l/i,! < 5}. (6) 

i=l 

For more flexibility, we let the weight vector depend on the time instant n, hence the notation Wn 
has been adopted. We will see later on that such a strategy speeds up convergence and decreases 
the misadjustment error of the algorithm. The well-known unweighted ii ball is nothing but i?^Jl,(5], 
where 1 G is a vector with Is in all of its components. Note that all the points that lie inside a 
weighted ii norm form a closed convex set. 

Having defined the weighted ii ball, which is the sparsity related constraint, our task now is to 
search for a point h in M.^ that lies in the intersection of the hyperslabs as well as the weighted ii 
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balls, i.e., for some Zq G Zi>o, 



find an ^ G fl (5,[e] n B^, K, 6]) . (7) 

n>zo 

As it will become clear later on, when discussing the convergence issues of the algorithm, the existence 
of Zq in (7) allows for a finite number of property sets not to share intersection with the rest. 

4. Proposed algorithmic framework 

The solution to the problem of finding a point lying in the intersection of a number of closed convex 
sets has been developed in the context of the classical POCS theory [20-23], in the case where there is 
a finite number of sets, and its recent extension, that deals with an infinite number of sets, originally 
proposed in [12]. The basic idea is very elegant: Keep projecting, according to an appropriate rule, 
on the involved convex sets; then this sequence of projections will, finally, take you to a point in their 
intersection. Hence, for our problem, metric projection mapping operators for, both, the hyperslabs as 
well as the weighted ii balls have to be used. Projection operators for hyperslabs are already known 
and widely used, e.g., [18,24]. The metric projection mapping onto a weighted ii norm will be derived 
here, and it was presented for a first time, to the best of our knowledge, in [10]. 

Each time instant, n, a new pair of training data {xn,yn) becomes available, and a corresponding 
hyperslab is formed according to (5). This is used to update the currently available estimate 
However, in order to speed up convergence, the update mechanism can also involve previously defined 
hyperslabs; for example, the hyperslabs formed at time instants n — g + 1, n, for some q G Z>o. Then, 
in order to obtain /in+i, an iteration scheme consisting of three basic steps, is adopted: a) the current 
estimate hn is projected onto each one of the q hyperslabs, b) these projections are in turn combined 
as a weighted sum and c) the result of the previous step is subsequently projected onto the weighted 
£i ball. This is according to the concepts introduced in [12] and followed in [13,14,24]. Schematically, 
the previous procedure is illustrated in Fig. 1, for the case of g = 2. 

In detail, the algorithm is mathematically described as follows: 
Algorithm. Let q G Z>o, and define the following sliding window on the time axis, of size at most q 
(to account for the initial period where n < g — 1), in order to indicate the hyperslabs to be considered 
at each time instant: 

J'n := max{0, n — g + 1}, n, Vn G Z>o. 
For each n, define the set of weig hts ^} 

jeJn C (0, 1] such that Xljej"™ '^j"^ ~ Each cjj""^ quantifies 
the contribution of the j-th hyperslab into the weighted combination of all the hyperslabs that are 
represented indicated in J'n. 
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Figure 1. The ^l ball is shown with dotted lines. At time n, the estimate hn is available. 
For q = 2, two hyperslabs are involved in the update recursion, those associated with 
time instants n and n — 1. The new update, hn+i, results by first projecting /i„ onto 
the hyperslabs, then combining the resulting projections and finally projecting onto the 
weighted £i norm, that is defined at time n and which is drawn using the full line. 

Given an arbitrary initial point ^ the following recursion generates the sequence of estimates 
\/n e Z>o, 

hn+1 ■= PBe^lw„,S] ^^n + y^n u'f^ Psj[e]{hn) - , (8) 

where Psj[e] and PBi^[wn,s] denote the metric projection mappings onto the hyperslab, defined by the 
j-th data pair, and onto the, (currently available) weighted ii ball, respectively. As it will be shown 
in the analysis of the algorithm in Appendix B, in order to guarantee convergence, the extrapolation 
parameter //„ takes values within the interval (0,2A^„,), where is computed by 

gjg.7n'^f' H^S,M(^")-^"II' -f ,,(n)p 

Mn := { llE,...^rPs,H(^.)-^.IP' ^^^W^'^") ^ (9) 

otherwise. 

Notice that the convexity of the function implies that M.n > 1, Vn G Z>o. 

It is interesting to point out that the algorithm is compactly encoded into a single equation! Also, 
note that projection onto the q hyperslabs can take place concurrently and this can be exploited if 
computations are carried our in a parallel processing environment. Moreover, q can be left to vary 
from iteration to iteration. The dependence of the performance of the algorithm on q will be discussed 
in Section 8. 
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It turns out that the projection mappings involved in (8) and (9) have computationally simple forms 
and are given in (10) and Section 5.2. The algorithm amounts to a computational load of order 0{qL) 
multiplications/additions and 0{L\og2 L) sorting operations. The dependence on q is relaxed in a 
parallel processing environment. 

Having disclosed the algorithmic scheme for the update of our estimate at each iteration step, as 
measurements are received sequentially, there are a number of issues, yet, to be resolved. First, 
the involved projection mappings have to be explicitly provided/derived. Second, a strategy for the 
selection of the weights in the weighted ii norm need to be decided. Third, the convergence of 
the algorithm has to be established. Although the algorithm stands on the shoulders of the theory 
developed in previously published papers, e.g., [12,13,25], the developed, so far, theory is not enough to 
cover the current algorithm. Since we do not use the ii norm, but its weighted version, the projection 
mapping Pbi_^Iw„,5] in (8) is time-varying and it also depends on the obtained estimates. Convergence 
has to be proved for such a scenario and this is established in Appendix B. 

5. Projections onto Closed Convex Sets 

A subset C of will be called convex if every line segment {Xh + (1 — \)h' : A G [0, 1]}, with 
endpoints any h, h' G C, lies in C. 

Given any set C C M^, define the (metric) distance function d{-,C) : — )■ R to C as follows: 
Va; G R^, d{x,C) := inf{||a; — /|| : / G C}. If we assume now that C is closed and convex, then 
the (metric) projection onto C is defined as the mapping Pc '■ R'^ — J" C which maps to an a; G R^ the 
unique Pcix) G C such that ||a3 — Pc{x)\\ = d{x,C). 

5.1. Projecting onto a hyperslab. The metric projection operator Pg„[e] onto the hyperslab (5) 
takes the following simple analytic form [22,23]: 



V/^GR^ Ps„[eih) = h+{ 



_ '^ni if Vn 6 > /l iCn. , 

0, if \lT^Xr, - y„| < e, (10) 



5.2. Projecting onto the weighted £i ball. The following theorem computes, in a finite number 
of steps, the exact projection of a point onto a weighted £i ball. The result generalizes the projection 
mapping computed for the case of the classical unweighted li ball in [26]. In words, the projection 
mapping exploits the part of the weighted £i ball that lies in the non-negative hyperoctant of the 
associated space. This is because the projection of a point onto the weighted £i ball lies always in 
the same hyperoctant as the point itself. Hence, one may always choose to map the problem on the 
non-negative hyperoctant, work there, and then return to the original hyperoctant of the space, where 
the point lies. The part of the weighted li norm, that lies in the non-negative hyperoctant, can be 
seen as the intersection of a closed halfspace and the non- negative hyperoctant, see Fig. 2. It turns out 
that if the projection of a point on this specific halfspace has all its components positive, e.g., point Xi 
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H, 




Figure 2. This figure illustrates the geometry of the weighted £i ball i?^Jit>,(5], and 
more specifically its intersection with the non- negative hyperoctant of The reason 
for studying only the non-negative hyperoctant is justified by Lemma 1. Two points 
iCi, X2 of are taken to demonstrate the concepts introduced in the manuscript. Notice 
that P^~{xi) > 0, which implies by Lemma 2.1 that Pq^{xi) = P^-{xi). Notice also 
the case of X2 where some components of P^- {X2) obtain negative values. Such a case 
mobilizes Lemma 2.2. 

in Fig. 2, then the projection on the halfspace and the projection of the point on the weighted £1 ball 
coincide. If, however, some of the components of the projection onto the halfspace are non-positive, 
e.g., point X2 in Fig. 2, the corresponding dimensions are ignored and the projection takes place in 
the resulting lower dimensional space. It turns out that this projection coincides with the projection 
of the point on the weighted li ball. The previous procedure is formally summarized next. 

Theorem 1. Given an G MJ"\B^^ [it;„, 5], the following recursion computes, in a finite number of steps 
(at most L), the projection of h onto the ball i?^Jit>„,, 5], i.e., the (unique) vector PBi^[wn,5]{h) G M'^. 
The case oi h E B^^[wn,Si] is trivial, since PBi^[wn,5]i.^) = ^• 

(1) Form the vector [\hi\/wn,i, . . . , \hL\/wn,LY ^ 

(2) Sort the previous vector in a non-ascending order (this takes 0{Llog2L) computations), so 

that [\hr{l)\/Wn,r{l), • • • , \hr{L)\ / Wn,r{L)]'^ , with \hr(l)\ / Wn,T{l) > ■ > \hr{L)\ / Wn,r{L) , IS obtained. 

The notation r stands for the permutation, which is implicitly defined by the sorting algorithm. 
Keep in memory the inverse which moves the sorted elements back to the original positions. 



(3) Let ri := L. 

(4) Let / = 1. While / < L, do the following. 



(a) Let A* := /. 



(b) Find the maximum among those j G 1, such that 

(c) If j^, = ri then break the loop. 

(d) Otherwise, set r/+i := j*. 

(e) Increase / by 1, and go back to Step 4a. 




1 "'n,r(i)|feT(i)|-^ 
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(5) Form the vector p G W^* whose j-th component is given by := | hr(j) \ — ^^■^^')J^^^') ^ ^ Wn,T{j) ■ 

(6) Use the inverse mapping r"^, met in step 2, to insert the number pj into the T~^(j) position 
of the L-dimensional vector p, i.e., Pt-'^u) '■= Pj-, Vj G 1,ta,, and fill in the remaining L — 
positions of p with zeros. 

(7) The desired projection is PBi^[wn,5]{h) = [sgn{hi)pi, . . . ,sgn{hL)pL]'^ G M"^, where the symbol 
sgn(-) stands for the sign of a real number. 

□ 

Proof. The proof is given in Appendix A. It follows a geometric approach, instead of the Lagrange 
multipliers methodology, which was followed in [26] for the case of the unweighted ii norm. □ 

6. Weighting the ii Ball 

Motivated by the strategy adopted in [19], the sequence of weights {Wn)n^z>o designed as follows; 
let the i-th component of the vector Wn be given by 

1 



-, VzG VnGZ>o, (11) 

I '^n,i I ~r C„ 

where (e'„)n6Z>o is a sequence of small positive parameters, which are used in order to avoid division by 
zero. An illustration of the induced geometry can be seen in Fig. 1. A way to design the parameters 
(e'„)n6Z>o will be given in the next section. The corresponding algorithm will be referred to as the 
Adaptive Projection Algorithm onto Weighted ii balls (APWLl). The unweighted case, i.e., when 
Wn '■= 1, Vn G Z>o, will be also considered and is denoted as APLl. 

Remark 1. The radius 6 of the ii norm, on which we project, depends on whether the unweighted or 
the weighted version is adopted. In the unweighted ii norm case, the optimum value of the radius is 
apparently 6 := However, in the weighted case, 6 is set equal to S* = The reason for 

this is the following. 

Consider the desirable situation where our sequence of estimates (h.„)„gz>o converges to h^, i.e., 
\imn-^oohn = h^. Moreover, let e'^ > e' > 0, Vn G Z>o, where e' is a user-defined parameter. Then, 

Yli=l'^n,i\hn,i\ < Yli=l Jh~iT?^ ^ ^>0' ^'^^ 
L L 



lim sup Wni\hni\ < lim sup -- — ^ — - = lim 

„ , - ^ ' ' ' „ , - - ' \n -1-1-1=' r?.— s-nn ' 









1 hn,i 


\+e' 



n—^oo 

i=l 





hn,i 




1 hn,i 


\+e' 



i£supp(h,) ' j^supp(h») ' i6supp(h,) 



1^0 ■ 



The previous strict inequality and the definition of lim sup suggest that there exists an mi G Z 



>o 



such that Vn > mi we have Yli=i'^n,i\hn,i\ < ||h*||£o. In other words, we obtain that Vn > mi, 
hn G Bij^[wn, ||^*||£o]- Hence, a natural choice for 6 in the design of the constraint set i?£jiu„,5] is 
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At least, such a choice is justified Vn > mi, since it becomes a necessary condition for having 
{hn)n£Z>o converge to the desirable h^. □ 

7. Convergence Properties of the Algorithm 

It can be shown that, under certain assumptions, the previous algorithm produces a sequence of 
estimates {hn)n&z>o, which converges to a point located arbitrarily close to an intersection as in (7). 
The convergence of the algorithm is guaranteed even if a finite number of closed convex sets do not 
share any nonempty intersection with the rest of the convex constraints in (7). This is important, since 
it allows for a finite number of data outliers not to disturb the convergence of the algorithm. 
Assumptions. 

(1) Define Vn G Z>o, f2„ := Bi^ [wn, S] fl (f]j^j^ ; i-^-, the set f2„ is defined as the intersection 
of the weighted ii ball and the hyperslabs that are considered at time n. Assume that there 
exists a. zq & Z>o such that fl := f]n>zo ^- That is, with the exception of a finite number 
of r2„s, the rest of them have a nonempty intersection. 

(2) Choose a sufficiently small e" > 0, and let Wn e Z>o, ^ G [e", 2 - e"]. 

(3) The interior of Q is nonempty, i.e., int(f2) ^ 0. For the definition of int(-) see Fact 2 in Appendix 
B. 

(4) Assume that u := infjo;]"'' : j G J^n,n £ ^>o} > 0. In words, none of the weights, used to 
combine the projections onto the hyperslabs, will fade away as time n advances. 

□ 

Theorem 2 (Convergence analysis of the Algorithm). Under the previously adopted assumptions, the 
following properties can be established. 

(1) Every update takes us closer to the intersection Q. In other words, the convergence is monotonic, 
that is, Wn > zq, d{hn+i,fl) < d{hn,fl). 

(2) Asymptotically, the distance of the obtained estimates from the respective hyperslabs tends to 
zero. That is, lim„^oo Hiax{(i(h,„, ^^[e]) : j G J'n} = 0. 

(3) Similarly, the distance of the obtained estimates from the respective weighted ii balls tends 
asymptotically to zero. That is, \imn^ood{hn,Bi-^^[Wn,6]) = 0. 

(4) Finally, there exists an /i* G such that the sequence of estimates {hn)nez>Q converges to, 
i.e., lim„_5.oo hn = h^, and that 

K G (liminfS^^ [■^n,^] ) n ( lim inf Pi ^^[e] ) . 

\ j&Jn / 

Here, lim inf „^oo C*™ := Un>o rim>n for any sequence (C„)„ez>o C M^, and the overline 
denotes the closure of a set. In other words, the algorithm converges to a point that lies 
arbitrarily close to an intersection of all the involved property sets. 

□ 
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Figure 3. Sparse system identification example with L = 100 and 5 = 5. (a) and 
(b) shows the performance comparison of the proposed techniques with the LMS-based 
methods for high and low noise respectively, (c) shows the effect of different q values in 
comparison to the LASSO performance. 

Proof. The proof of these results, several auxiliary concepts, as well as details on which assumptions 
are activated, in order to prove each result, can be found in Appendix B. □ 

Remark 2. Regarding Assumption 3, the condition int n^g^^^ -B^Jm^, 5] 7^ can be easily satisfied. 
To see this, choose arbitrarily a sufficiently small e' > 0, and let in (11): > e', Vn G Z>o. Then, notice 
by Fig. 2 that Vn G Z>o, Vz E 1, L, = 6{\hn,i \ > 5e'^ > 5e' . This clearly implies that Vn G Z> 
B(,^[l,d€'] C Bi^[Wn,6], \/n G Z>o. It is easy now to verify that 5(0, ^) := {heR^ : \\h\ 
[1, 6e'] C nn6Z>o i'^ri, S], which implies, of course, that G int rinez>o I'^n, S] ^ 0. 



< 



0, 
□ 



8. Performance evaluation 

In this section, the performance of the proposed algorithms is evaluated against both time-invariant 
and time- varying signals and systems. It is also compared to a number of other online algorithms such 
as the Zero- Attracting LMS (ZA-LMS) [6], the Reweighted ZA-LMS (RZA-LMS) [6], and the Recursive 
LASSO (RLASSO) [7]. Moreover, the LASSO performance, when solved with batch methods [27,28] 
is also given, since it serves as a benchmark for the best achievable performance with £i-regularized LS 
solvers. All the performance curves are the result from ensemble averaging of 100 independent runs. 
Moreover, for all the projection based algorithms tested, in all simulation examples, /i„ was set equal 
to and the hyperslabs parameter e was set equal to 1.3 x a, with a being the noise standard 

deviation. Even though such a choice may not be necessarily optimal, the proposed algorithms turn 
out to be relatively insensitive to the values of these parameters. Finally, wj"'' of (8) are set equal to 
1/9, Vj G Jn, Vn G Z>o. 

8.1. Time-invariant case. In this simulation example, a time-invariant system having L = 100 
coefficients is used. The system is sparse with 5 = 5, i.e., it has only five nonzero coefficients, which 
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are placed in arbitrary positions. The input signal x is formed with entries drawn from a zero-mean 
normal distribution with variance 1. 

In Figs. 3(a) and 3(b) the performance of the new algorithm is compared with that obtained by 
the LMS-based methods, in different noise levels. The noise variance was set equal to two different 
values, i.e., cr^ = 0.1 and = 0.001 corresponding to SNR values of approximately — 3dB and 
17dB, respectively. Two different values of the parameter q have been considered, namely 5 an 25. 
Moreover, with respect to the ZA-LMS and the RZA-LMS, the "optimized" tag indicates that the free 
parameters /i and p were optimized, in order to give the best performance at the 450th iteration. A 
different parameter setup could lead to faster convergence of both LMS-based methods, albeit at the 
expense of higher error- floors. In Fig. 3(a) we observe that APWLl exhibits the best performance 
both with respect to convergence speed as well as steady-state error floor. In fact, the larger the value 
of q is the faster the convergence becomes. However, when the unweighted £i ball is used (APLl), the 
method assumes relatively high error-floors, worse than both the LMS-based methods. 

In all the cases, unless the contrary is explicitly stated, the adopted values for 5 were: 5 := 
and 5 := S for the APLl and the APWLl respectively. The sensitivity of these methods, on using 
different values of 5, will be discussed in section 9.1. Moreover, the adaptation strategy of in (11) 
was decided upon the following observation. A very small e^, in the range of [0.001, 0.01], leads to low 
error-floors but the convergence speed is compromised. On the other hand, when is relatively large, 
e.g., e'„ > 0.1, then fast convergence speed is favored at the expense of a higher steady state error 
floor. In order to tackle this issue efficiently, e'„ can start with a high value and then getting gradually 
smaller. Although other scenarios may be possible, in all the time invariant examples, we have chosen: 
e'^:= e' + Vn G Z>o, where e' is a user-defined small positive constant. 

Fig. 3(b) corresponds to a low noise level, where the improved performance of the proposed algorithm, 
compared to that of LMS-based algorithms, is even more enhanced. 

It suffices to say, that this enhanced performance is achieved at the expense of higher complexity. 
The LMS-based algorithms require 0{L) multiply/add operations, while the APWLl demands q times 
more multiply/add operations. However, in a parallel processing environment, the dependence on q 
can be relaxed. 

In Fig. 3(c) the performance improvement of APWLl, as the q value is increasing, is examined 
and compared to performance of the RLASSO algorithm. In this system identification case, the 
L X K regression matrices, Hj, in [7] are built using the input vectors (£C„)„gz>o according to 
[x]^^T-K-K, ■ ■ ■ for r G Z>o, and K G Z>o- Parameter K was set equal to 5. As a ref- 

erence, the batch LASSO solution is also given, using the true delta value, i.e., S := The test is 

performed for two different noise levels with the solid and the dotted performance curves corresponding 
to 0"^ = 0.1 and cr^ = 0.001. Clearly, the convergence speed rapidly improves as q increases, and the 
rate of improvement is more noticeable in the range of small values of q. 
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Figure 4. Sparse signal reconstruction example with L = 2000 and S = 20, for high 
and low noise levels. 



Observe that for large values of g, the performance gets "closer" to the one obtained by the LASSO 
and RLASSO methods. Of course, the larger the q the "heavier" the method becomes from a com- 
putationally point of view. However, even for the value of g = 60 the complexity remains much lower 
than that of RLASSO. The complexity of the latter algorithm rises up to the order of O {rL"^), where 
r is the number of iterations for the cost function minimization in [7, (7)]. Indicatively, in the specific 
example r needed to be larger than L in order the method to converge for all the realizations that were 
involved. 

In the sequel, we turn our attention to the estimation of a large vector. We will realize it in the 
context of a signal reconstruction task. We assume a sparse signal vector of 2000 components with 
S* = 20 arbitrarily positioned nonzero components having values drawn from a zero-mean normal 
distribution of unit variance. In this case, the observations are obtained from inner products of the 
unknown signal with independent random measurement vectors, having values distributed according 
to zero-mean normal distribution of unit variance. The results, are depicted in Fig. 4 for cr^ = 0.1 
(SNR=— lOdB), and = 0.001 (SNR=10dB), drawn with sohd and dashed lines, respectively. It is 
readily observed that the same trend, which was discussed in the previous experiments, holds true 
for this example. It must be pointed out that in the signal reconstruction task, the input signal may 
not necessarily have the shift invariance property [29,30]. Hence, techniques that build around this 
property and have extensively been used in order to reduce complexity in the context of LS algorithms, 
are not applicable for such a problem. Both, LMS and the proposed algorithmic scheme do not utilize 
this property. 

8.2. Time- varying case. It is by now well established in the adaptive filtering community, e.g., [29], 
that convergence speed and tracking ability of an algorithm do not, necessarily, follow the same trend. 
An algorithm may have good converging properties, yet its tracking ability to time variations may not 
be good, or vice versa. There are many cases where LMS tracks better than the RLS. Although the 
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Figure 5. Time- varying sparse system identification example. The system impulse 
response changes abruptly at iteration ^501. 

theoretical analysis of the tracking performance is much more difficult, due to the non-stationarity of 
the environment, related simulated examples are always needed to demonstrate the performance of an 
adaptive algorithm in such environments. To this end, in this section, the performance of the proposed 
method to track time-varying sparse systems is investigated. Both, the number of nonzero elements of 
as well as the values of the system's coefficients are allowed to undergo sudden changes. This is a 
typical scenario used in adaptive filtering in order to study the tracking performance of an algorithm 
in practice. The system used in the experiments is 100 coefficients long. The system change is realized 
as follows: For the first 500 time instances, the first 5 coefficients are set equal to 1. Then, at time 
instance n = 501 the second and the fourth coefficients are changed to zero, and all the odd coefficients 
from 7^7 to ^^15 are set equal to 1. Note that the sparsity level, S, also changes at time instance 
n = 501, and it becomes 8 instead of 5. The results are shown in Fig. 5 with the noise variance being 
set equal to 0.1. 

The curve indicated with squares corresponds to the proposed, APWLl method with q = 15. The 
performance of the RLASSO scheme with forgetting factor /3 = 1 is denoted by circles. The latter 
clearly outperforms the rest of the methods up to time instance 500. This is expected, since LS-type 
of algorithms are known to have fast converging properties. Note that up to this time instant, the 
example coincides with that shown with solid curves in Fig. 3(c). However, the algorithm lucks the 
"agility" of fast tracking the changes that take place after convergence, due to its long memory. In 
order to make it track faster, the forgetting factor /3 has to be decreased, in order to "forget" the remote 
past. However, this affects its (initial) converging properties and in particular the corresponding error 
floor. 

When P = 0.8 (curve denoted by diamonds), the tracking speed of the RLASSO is significantly 
improved, albeit at the expense of significantly increased error floor. The significant increase in the 
error floor is also noticed in the first period, where it converges fast, yet to a steady state of increased 
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Figure 6. Sensitivity of APWLl and LASSO to the S parameter. 

misadjustment error. Adjusting the f3 parameter to lead to lower error floors, one has to sacrifice 
tracking speed. For the value of /3 = 0.9, the RLASSO (curve denoted by stars) achieves the same 
tracking speed as our proposed method, however its error floor remains notably higher. In fact, the 
steady state performance of RLASSO in this case, reaches the levels of RZA-LMS (curve denoted by 
dots). 

There are two issues related to the proposed method that have to be discussed for the time-varying 
case. The first concerns the value of 6 and the other the adaptation strategy of e^. Physical reasoning 
suggests that 6, for the weighted £i ball, should be set equal to 5 for the first 500 iterations and then 
take the value 8. However, the actual sparsity levels can not be known in advance. As a result, in the 
example of Fig. 5, 6 was fixed to 9. As it will be discussed soon, the method is rather insensitive against 
overestimated 6 values. Concerning e'^, the adaptation strategy discussed in the previous section, needs 
a slight modification. Due to the fact that the system undergoes changes, the algorithm has to be alert 
to track changes. In order to achieve this, the algorithm has the ability to monitor abrupt changes 
of the orbit {hn)nez>o- Whenever the estimated impulse response changes considerably, and such a 
change also appears in the orbit (/i„)„gz^p, in (11) is reset to e' + 1 and it is gradually reduced 
similarly to the previous example. 

9. Sensitivity of APWLl to Non ideal Parameter Setting 

The robustness of any technique is affected by its sensitivity to non "optimized" configurations. In 
this section, the sensitivity of APWLl on 6 and e is examined. The sensitivity of APWLl is compared to 
the sensitivity that LASSO and LMS-based algorithms have with respect to their associated parameters. 

9.1. Comparing to LASSO. In Fig. 6, the solid lines indicated by diamonds, crosses and circles 
correspond to the performance of the APWLl, with g = 30, when the true S parameter is overes- 
timated by 50%, 100% or underestimated by 10%, respectively. The system h,^, under consideration 
has L = 100, 5 = 5 and cr^ = 0.1. The best performance, drawn with the solid curve indicated with 
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Figure 7. Sensitivity of the LMS-based methods on the fi and p parameters compared 
to the sensitivity of APWLl to 6 and e. 

squares, is achieved when the APWLl method is supphed with the true 6 value, i.e., when 6 = S. We 
observe that the tolerance in 6 underestimation is very limited, since even an underestimation by 10% 
leads to a significant performance degradation. On the other hand, APWLl is rather insensitive to 
overestimation. Indeed, overestimation even by 100%, compared to the true value, leads to acceptable 
results. For comparison, the sensitivity of the standard LASSO is presented with dashed lines. In this 
case, the optimized 6 value equals to The sensitivity of LASSO is clearly higher, particularly to 

the steady-state region. Observe, that only a 25% deviation from the optimum value (dashed line with 
diamonds) causes enough performance degradation to bring LASSO at a higher MSE regime, compared 
APWLl. Moreover, LASSO, similarly to APWLl, exhibits limited tolerance in 6 underestimated 6 
values. 

9.2. Comparing to LMS-based techniques. Besides the 5 parameter, APWLl also needs specifi- 
cation of the hyperslabs width, i.e., the parameter e. On the other hand, LMS-based methods need the 
specification of fi and p. Fig. 7 shows the performance degradation of APWLl (curves with x-crosses), 
ZA-LMS (curves with circles) and RZA-LMS (curves with rectangles), when they are configured with 
parameter values which deviate from the "optimum" ones. The x-axis indicates deviation, from the 
"optimal" values, in percentage. The problem setting is the one shown in Fig. 3(b) and the reported 
MSE is always evaluated at time instance 450, where convergence is assumed to have been achieved. 
The 0% discrepancy point, coincides with the best achieved performance of each method. For the LMS- 
based methods, the solid and dashed curves correspond to p and p, respectively. For the APWLl, the 
dashed and the solid curves correspond to e and 6, respectively. Starting with the latter parameter, as 
expected from the discussion in section 9.1, even a slight underestimation, i.e., negative deviation from 
the optimum, leads to a sudden performance degradation. On the positive side, the method exhibits a 
very low sensitivity. With respect to e, the sensitivity of APWLl is similar to the sensitivity exhibited 
by the LMS-based methods on the p parameter. However, LMS-based methods show an increased 
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sensitivity on the /i parameter for both negative and positive deviation. In addition, the optimum n 
value depends on the length of h^, as it is the case with the standard non-regularized LMS [30]. 

10. Conclusions 

A novel efficient algorithm, of linear complexity, for sparse system adaptive identification was pre- 
sented, based on set theoretic estimation arguments. Sparsity was exploited by the introduction of a 
sequence of weighted ii balls. The algorithm consists of a sequence of projections on hyperslabs, that 
measure data mismatch with respect to the training data, and on weighted ii balls. The projection 
mapping on a weighted ii ball has been derived and a full convergence proof of the algorithm has been 
established. A comparative performance analysis, using simulated data, was performed against the 
recently developed online sparse LMS and sparse LS-type of algorithms. 

Appendix A. The metric projection mapping onto the weighted ii ball Be^[w,5] 

The results in this section are stated for any Euclidean space M}, where / G 1,L. Moreover, given 
two vectors x := [xi, . . . , Xi]'^, y := [yi, . . . , yi]^ G M', then the notation x < {<)y means that Wi G 1, /, 
Xi < {<)yi- 

A well-known property of the metric projection mapping Pc onto a closed convex set C, which will 
be used in the sequel, is the following [22,23]: 

Vx G M', V/ G C, \\x - Pcix)f < \\x - ff - \\Pcix) - ff. (12) 

Define Qi := Bij^[w, 6] fl ]R>q, where M>o stands for the non-negative hyperoctant of M! (see Fig. 2). 
Define also the following closed halfspace: Hf := {tt G : Yl\=i'^i'^i ~ w^"^ ^ Clearly the 
boundary of is the hyperplane: Hi := {u ElS} : Yl\=i '^i'^i = w^u = 6}. It is easy to verify that 
Qi = Hf n M>o. Clearly, the boundary of Qi is Hi n R^q. 

Lemma 1. (1) For any x G M.\ the projection Pbi^[w,5]{x) belongs to the same hyperoctant as x 
does, i.e., if x^ := Pb,^[w,5]{x), then sgn(x*,i) = sgn(xi), Vi G 1,/. 
(2) Define the mapping abs : x = [xi, . . . , xi]'^ (-)■ [\xi\, . . . , {xiW^, 'ix G M'. It can be easily verified 
that abs is an one-to-one mapping of any hyperoctant of onto M>o, i-e., it is a bijection. Fix 
arbitrarily an a; G M'. Consider the mapping abs which bijectively maps the hyperoctant, in 
which X is located, to M>q. Then, Pbi^[w,s\{x) = ahs~^ ^Pe^^[i„_5](abs(cc)) j , where abs""*^ stands 
for the inverse mapping of abs. In other words, in order to calculate the projection mapping 
onto -B^Jiu, S], it is sufficient to study only the case of M>o- 

□ 

Proof. (1) Without any loss of generality, assume that x belongs to the non- negative hyperoctant 
of M'. We will show that also every component of a;* is, also, non- negative. In order to derive 
a contradiction, assume that there exist some negative components of x^. To make the proof 
short, and with no loss of generality, assume that the only negative component of x^, is 
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Define the vector n^, such that := and u^:,i := V? G 2, /. Since a;^, G [w, 5], we have 

that Yl\=i'^i\^*,i\ ^ which easily leads to Yl\=i'^iW*,i\ = Yl\=2'^i\^*A ^ Yl''i=i'^i\^*,i\ < 
i.e., G Moreover, notice that since < = then Xi — > Xi — n^ i = 

xi > 0. Hence, \\x — < (xi — + Yli=2(^i " = 11^ ~ This contradicts the 

fact that = Pbi,^[w,5]{x), and establishes Lemma 1.1. 
(2) Fix arbitrarily an aj G M}. As we have seen before, Pbi^[w.s]{x) will be located in the same 
hyperoctant as x. Let any u G Si, where Si stands for the intersection of Bi-^[w,6] with the 
same hyperoctant where x belongs to. As a result, we have that sgn(a;j) = sgn(uj), Vz G 1,/, 
and 

I I 
\\x - = ^{xi - Uif = ^(sgn(xi)|xi| - sgn{ui)\ui\f 
1=1 i=i 
I I 
= ^{sgn{xi)\xi\ - sgn{xi)\ui\f = ^{\xi\ - \ui\)'^ 

i=l i=l 

= ||abs(cc) - abs(n)|p. 

Notice here that abs is a bijection from Si to Qi, so that the previous equality results into the 
following: 

||abs(a3) — [w.s]{s.hs<{x))\\ = min ||abs(a;) — u'\\ = min ||abs(a;) — abs('u)|| 

^ u'eQi ueSi 



= min \\x - u\\ = \\x - Pb. [■w,5]{,x)\\ = ||abs(a;) - abs (Pb. [w.5]{x) 
Therefore, by the uniqueness of the projection, ahs (^PB(_^iw,S]{x)j = Pb^j^^ ,5](abs(£c)), and 



Lemma 1.2 is established. 



Lemma 2. Let an a; G ]R>o \ Qi, and 



□ 



^ , . maxjO, x^w — 6} . . 

x^ := Prr- (x) = X — w. (13) 

(1) Assume that x^ > 0. Then, Pq^{x) = P^-{x). 

(2) Make the following partitions x = [|], x^ = [|*], where 1,1 E 1,1, 1 + I = I, and x,x^ G MJ, 
X, x^ G M'. Assume, now, that there exists an / G 1, / such that x^ < 0. Then, 

PQ,{xf=[PQ^{x)^, 

□ 

Proof. (1) Since x^ := P^-{x) > 0, it is clear that a;* G Hj^ fl M>o = Qi- Hence, 

min \\x — u\\ < \\x — a;*|| = lla; — Pj^-{x)\\ = min \\x — u\\ < min \\x — u\\. 
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where the last inequahty comes from Qi C Hf . Thus, ||a; — P^-(a3)|| = min^tgQjIa:; — 
Hence, by the uniqueness of the projection, Pq^{x) = P^-{x), and Lemma 2.1 is estabhshed. 
(2) Since Hi is a hyperplane, G Hi, {u — x^:)'^{x — x^) = 0, which imphes, of course, that 



\/ue HiD Mln, itJ' - x^Vix - x^) = 0. Thus, \/ue HiH ™' 



>0) 



u — a; 



I - _ - ||2 I II ~ ~ II 2 I II ||2 



(14) 



This in turn imphes that 



Pqi{x) = argmin{||n - a3||^, : ue HiH R>q} 



a.Tgmm{\\u — x\\^i : uE 

|2 



argmm|||K, — x^,\ 



u — x 



2 . 



u w + u w 



5} 



u G K>o, u e u^>o. 



u w + u w = 5}. 



(15) 



By our initial assumption a;^, < 0. Hence, it is easy to verify that Vn G M>o, Vu G M>q \ {0}, 



u — x^ 



In — which evidently suggests that 



argmin{||n — + \\u — x*\\'ii : u G M>o, "u G ]R>o, 'u'^iu + n^i«> = 6} 



argmm|||u — x^\ 



\u - xJLr : u G 



u^w = 6,u = 0} 



(16) 



Now, since x G M>o \Qi, it is clear by the geometry of the Qi that Pq^{x) will be located on 
Hi nM>Q. Hence, by (14), (15), and (16), one can verify the following: 



Pqi{x) = argmin{ 
= argmin{ 
= argmin{ 
= argmin{ 
= argmin{ 
= argmin{ 
= argmin{ 



PQiix) 



u — x\ 



u — x^ 



u — x\ 



u — x\ 



u — x\ 



u — x\ 







This establishes Lemma 2.2. 



|2 . 



u G M>o, u G ]R>o, 'uFw + u^w = 6} 
u-x^fj: nGM>o, 



ii^w = 5,u = 0} 



u G ]R>o, u^w = 6,u = 0} 

\\u — x\\'^i : u G M>o, u^w = 5, w = 0} 

■u G ]R>o, w^iu = 5, n = 0} 

ue Hif] M>o, w = 0} 



□ 



Lemma 3. Assume an a; G Min such that Vi G 1,/ — 1, — > Moreover, let a;^, := P„-{x). 



Assume that there exists an zq £ Ij ^ such that x* j„ < 0. Then, Vz > zq, a;* j < 



□ 
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Proof. Here we consider only the case where x G M>o \ Qi, i.e., x'^w — 6 > 0. Notice by (13) that 

Xi x^w — 6 , , 

X.,. < ^ — < „ „„ ■ 17 
Wi \\wp 

Now, notice also that by the construction of x and by our initial assumption, we have that 

L, . . Xi Xj„ X w 6 

> zo, — < — < — n — fi^. 
Wi Wi^, \\w\\^ 

However, by (17), this is equivalent to a;*^j < 0, Vz > iq, which estabhshes Lemma 3. □ 

A. l. The proof of Theorem 1. Notice that Step 1 is due to Lemma 1. Step 4b refers to the attempt 
of the algorithm to locate the negative components of a vector, according to Lemma 3. Step 4c refers 
to Lemma 2.1, while Step 4d corresponds to Lemma 2.2. 

Appendix B. The proof of Theorem 2 

B. l. Preliminaries. 

Definition 1 (Subgradient and sub differential [31]). Given a convex function O : — t- M, a 
subgradient Q'{x) of at a; G is an element of M'^, which satisfies the following property: 
Q'{xY{y — x) + Q{x) < Q{y), Vy G M^. The set of all the subgradients of 6 at the point x will 
be called the subdifferential of 6 at x, and will be denoted by dQ{x). Notice that if G is (Gateaux) 
differentiable at x, then the only subgradient of 6 at is its differential. □ 

Fact 1. The subdifferential of the metric distance function d{-,C) to a closed convex set C C is 
given as follows [31]: 

' Ncix)nB[0,l], xgC, 



dd{x,C) 



d{x,C) ' 

where Nc{x) := {y eR^ : y^{f -x) < 0,V/ G C}, and S[0, 1] := {y G : \\y\\ < 1}. Notice that 
Va; G M^, ||(i'(a;, C)|| < 1, where d'{x,C) stands for any subgradient in dd{x,C). □ 

We will give, now, an equivalent description of the Algorithm in (8), which will help us in proving 
several properties of the algorithm. 

Lemma 4 (Equivalent description of the Algorithm in (8)). Define the following non- negative func- 
tions: 

fv.^^ !^i^i^^^^^d(a;,5,-[e]), ifX„7^0, 
[O, ifX„ = 0, 

where X„ := {j E Jn '■ hn ^ Sj[e]}, and L„ := X]j6j-„ ^j^'^d{hn, Sj[e]). Then, (8) can be equivalently 
written as follows: 
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where A„ G (0, 2), Vn G Z>o, and 6^(/i„) is any subgradient of 6.„ at □ 

Proof. First, a few comments regarding L„ in (18) are in order. It can be easily verified by the 
definition of X„ that 3jo G J'n ■ hn ^ •S'jg [e], which is in turn equivalent to d{hn, Sj^,[e]) > 0. Hence, 
Ln > wj"^rf(^„,5'jje]) > 0, and (18) is well-defined. The reason for introducing L„ in the design is to 
give the freedom to the extrapolation parameter /i„ in (8) to be able to take values greater than or 
equal to 2; recall that /i„ G (0, 2Mn) and Mn > 1, Vn G Z>o, in (9). 

Basic calculus on sub differentials [31] and the definition of X„ suggest that 



E,ex J"''t'''^'^^ 9d{x,SAe]), ifX„^0, 
{0}, ifX„ = 0. 

Hence, in the case where X„ 7^ 0, Fact 1 implies that 

n' (h ^j'^djhn, Sj[e]) hn - PsMi^n) 

''^ ''^'^^ Ln dihn,SAe]) 

= ^Y.^f\hn-PsMi^hn)) 

= ^Y.^f^^--PsM(hn)). (20) 

Clearly, if X„ ^ 0, then Q'niKi) = ^ Y^jej^^f^ Ps^[e]{hn) = hn- Notice that the same equivalence 
holds true also in the case where X„ = 0, since in such a case /i„, G f]j^j^ Sj[e] hn = Psj[e]{hn),^j G 
J7n. In other words, we have derived the following: Vn G Z>o,0^(/in) = <(=^ Yj£j„^j^^ Psj[e]{hn) = 
hn- By this result, if we substitute (20) in (19), and if we define /i„ := A„A^„, Wn G Z>o, where 
is given in (9), then we obtain the recursion given in (8). □ 

Next are a few observations on the function 6„, which will help us to establish several convergence 
properties of the Algorithm in (8). First, notice that 

X„ = ^ ^„ G fl S,[e] ^ {ujf^hn = Ujf^Ps,[e]{hn)^ Vj G Jn) 

^hn=J2 ^f^PsM^hn) ^ e'nihn) = 0. 

In the previous relation, the symbol =^ becomes if we assume that Hjej™ ^A'^] ^ [^2, Proposition 
2.12]. Hence, if f].^j^ S,[e] ^ 0, then, X„ = ^ h„ = ^^.^^^ ujfPs^y^hn) ^ Q'nihn) = 0. Moreover, 
in the case where f]j^j^ ^jM 7^ 0; '^^^ can verify also by the definition of 6^ that 



lev<o 6„ 

where lev<o 9, := {?/ G : &n{y) < 0}. 



^^ x„ = 0, 
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Additionally, in the case where fljej'n'^if^] 7^ then we can establish the following equivalency: 
hn G lev<o On ^ Xn = 0- This can be proved as follows. For the "<^=" direction, we have that 
I„ = -v^ ^„ G fljej'n'^iH C = lev<o 6„. As for the direction, assume for a contradiction 

that I„ 7^ 0. Then, by the preceding discussion, we have hn G fljeXn "^iH' which is an absurd result if 
we recall the definition of X„. Thus, necessarily, X„ = 0, and the claim is proved. In other words, in 
the case where fljej'n "^iH then, X„ = <(=^ 0^(h,„) = 0, and thus 

hn G lev<o e„ ^ K{hn) = 0. (21) 

Definition 2 (Subgradient projection mapping [32]). Given a convex function G : MJ" — )■ R, such that 
lev<o 7^ 0, define the subgradient projection mapping Tq : — )■ uizi/i respect to as follows: 

{x- j^^Q'{x), ifa;^lev<o0, 
\x, if X e lev<o 0, 

where 0'(a;) stands for an arbitrarily fixed subgradient of at x. If / stands for the identity mapping 
in M^, the mapping := I + A(Te — /), A G (0, 2), will be called the relaxed subgradient projection 
mapping. Moreover, similarly to (12), an important property of Tq^ is the following [32]: 

Wx G M^ V/ G lev<o 0, ^^W^ - T^'\^)\\' < - ff - \\T^'\^) ' /f- (22) 

□ 

Now, (12) and (22) can be combined as follows. 

Lemma 5. Let a closed convex set C C M.^, and a convex function : M.^ — )■ M such that Cnlev<o 7^ 
0. Then, 

Vx G M^ V/ G C n lev<o0, ^11^ - PcT^'\x)f < \\x - ff - \\PcTi'\x) - /f . 

□ 

Proof. This is a direct consequence of [12, Proposition 1]. □ 

Fact 2 ( [12]). Let a sequence (a;„)„g2>o C M.^, and a closed convex set C C M'^. Assume that 

3k > : V/ G C, Vn G Z>o, - < - /f - - /f . 

Assume, also, that there exists a hyperplane 11 such that the relative interior of the set C with respect 
to n is nonempty, i.e., rin C ^ ^. Then, 3a3* G : x^ = lim„^oo ^n- 

Here, given any T C M , rix C := {y G : 3p > 0, B{y,p) n T C C}. As a byproduct of this 
definition, the interior of C is defined as intC := h^L C. Hence, it becomes clear that if intC 7^ 0, 
then we can always find a hyperplane H C M.^ such that rin C 7^ 0. This fact will be used in the proof 
of Theorem 2.4. □ 
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Fact 3 ( [12]). Let C C be a nonempty closed convex set. Assume also an / G intC, i.e., 3p > 
such that B{f, p) C C. Assume, now, an a; G M"^ \ C, and a t G (0, 1) such that / + t{x — f) ^ C. 
Then, d{x,C) > p^. □ 

Lemma 6. The set of all subgradients of the collection of convex functions (6n)nez>o! defined in (18), 
is bounded, i.e., Vn G Z>o, Va; G M^, [|e;(a;)|| < 1. □ 

Proof. Fix arbitrarily an G Z>o. Here we deal only with the case X„ ^ 0, since otherwise, the function 
0„ becomes everywhere zero, and for such a function. Lemma 6 holds trivially. 

By (18), Fact 1, and some calculus on subdifferentials [31], we obtain that Va? G M^, the norm of 
any subgradient 0^(a;) satisfies the following: 

l|0U^)ll = llE r d'{x,SAe])\\ 



a;f^rf(fc„,g,[6]) „^,^ 
+ 2^ ^ ^ \\d{x,Sj[e])\\ 

jejn: xeS,[e] " 

^ ^ u^''^d{K,SAe])\\x-Ps,i.lix)\\ ^ a;f)rf(fc„.,g,[6]) 

- ^ Ln d(x,SAe\) ^ Ln 

^ cuf c;(fe^,5',[e]) ^ Lo^''^d{h„.,S,[e]) _ 

This establishes Lemma 6. □ 

B.2. The proof of Theorem 2. 

(1) Assumption 1, Definition 2, and (21) suggest that (19) can be equivalently written as follows: 
Vn > zq, hn+i = PBf>^[w„,s]Te^\hn), where T^J^^ stands for the relaxed subgradient projection 
mapping with respect to 0„. Notice here that Wn > zq, lev<o 0n = fljeXn ^ji^] ^ Cljejn ^iM- 
Thus, by Assumption 1 and Lemma 5, we have that Vn > ZQ,^f G Q, 
2-A„,,. , ■|2_2-A„ (A„ 



< \\K -fW- \\PB,^MTi':\K) -fr = WK -fW- wK+i - fw (23) 

^ \\K+i-f\\ < \\hn-fl (24) 
If we apply infj^gj^ on both sides of (24), we establish our original claim. 
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(2) The next claim is to show that under Assumption 1, the sequence {\\hn — /||)n6Z>o converges 
V/ G Q. To this end, fix arbitrarily f E Q. By (24), the sequence {\\hn — /||)n>zo is non- 
increasing, and bounded below. Hence, it is convergent. This establishes the claim. 

Next we will show that under Assumption 1, the set of all cluster points of the sequence 
iKi)nez>o is nonempty, i.e., (t{{hn)n&i.>o) ^ ^■ 

We will first show that the sequence (^n)nez>o is bounded. This can be easily verified as 
follows; fix arbitrarily an / G f2 and notice that Vn > zq, \\hn\\ < \\hn, — /|| + ||/|| < W^zq ~ + 
||/||. Define now D := m.a.x{\\hzg — f\\ + ||/||, H/iqIIi • • • ) ll^^o-ill}) which clearly implies that 
Wn G Z>o, \\hn\\ < D. Since (/in)nez>o is bounded, there exists a subsequence of (h.„)„gz>o which 
converges to an h,^, G (Bolzano- Weierstrass Theorem). Hence, G €((/i„)„gz>o) 7^ 0- This 
establishes the claim. 

Let Assumptions 1 and 2 hold true. Then, we will show that lim„_^oo ©nC^-n) = 0. First, we 
will prove that 

lim = 0. (25) 



We will show this by deriving a contradiction. To this end, assume that there exists a 5 > 



and a subsequence {nk)k&>Q 

such that VA; G Z>o, l:''fu"''\ > 6. We can always choose a 



sufficiently large ko such that VA; > /cq, rik > zq. 

Let, now, any / G ^2, and recall that Q C -B^ Jif;„j. , 5] , VA; > A;o. Then, verify that the 
following holds true Wk > ko: 

\\hnk+l - /IP = \\PBi^[-Wn^^,5] ^^nfe " Kk || q/"''^^^ "j || 2 ^^fe (^'^fe )^ ~ 



< ll/i - X Q"fc(^"-fc) Q/ /l, N _ f ||2 



where (12) was used for PBt^[wn^,s] in order to derive the previous inequality. By the definition 
of the subgradient, we have that 6^^(/i„J^(/ - h„J + 6„^(h,„J < Qn^if) = 0. If we merge 
this into (26), we obtain the following: 



0n,(V) 0n,(V) 



/II A„,(2 A„J||Q,^^^^^^||,. 

This, in turn, implies that 

VA; >ko, < {e"5r < A„,(2 - A„J^|^^^J^ < || ^ - /f - || V+i " /f • (27) 



25 



However, as we have already shown before, (||^n — f\\)n£Z>o convergent, and hence it is a 
Cauchy sequence. This imphes that hmfc_5.oo(||^nfe — fW^ — W^nk+i ~ /IP) = 0, which apparently 
contradicts (27). In other words, (25) holds true. 

Notice, now, that for all those n G Z>o such that 6^(/i„) ^ 0, we have by Lemma 6 that 

e„,w . Iie>„)l|^ < (28) 

Notice, also, here that for all those n G Z>o such that 0^(/i„) = 0, it is clear by the well-known 
fact G dQn{hn) ^ hn E argmin{6„(a;) : x G M.^} that 6„(h,„) = 0. Take lim„_^oo on both 
sides of (28), and use (25) to establish our original claim. 

Let now Assumption 1 holds true. Then we show that there exists a. D > such that 
Vn G Z>o, Ln < D. Notice, that Vn > zo, Vj G V/ G il, 

d{h^,S,[e]) = \\K-Ps^[e]iK)\\ < \\K-f\\ + \\f-Ps^[e]iK)\\ 

<2||h,„-/|| <2||h,„-/||, 

where we have used (12) and the monotonicity of the sequence {\\hn — f\\)n>zo- Then, by the 
definition of L„, 

yn > Zo, Ln = J2 ^f^diK, S,[e]) < 2 J] oof'^WK, - f\\ = 2\\K, - f\\. 

Choose, now, any D > max{2\\hzg — f\\, Lq, . . . , L^g-i} > 0, and notice that for such a D the 
claim holds true. 

Let Assumptions 1, 2, and 4 hold true. By (18), we observe that 

> V d\K,S,[e]) > max{d\K,S,[e]) : j G X}- 

Hence, if we take lim^^oo on both sides of the previous inequality, we establish Theorem 2.2. 
(3) Here we establish Theorem 2.3. Let Assumptions 1 and 2 hold true. We utilize first (12) and 
then (22) in order to obtain the following: V/ G f2, 

ii(/ - PB,,M^)iT^e:\hn))r < \\Ti':\K) - /r - \\PB,,i^.,s]4':\hn) - fr 
= \\Tt\hn)-fr-\\hn^i-fr 

< \\hn - /ir - '^^whn - Ti':\hn)f - \\hn+i - /f < ii/i„ - /ir - ii/i„+i - /f . 

Take lim„^oo on both sides of this inequality and recall that the sequence (||h.„ — /||)nez>o is 
convergent, and thus Cauchy, in order to obtain 

hm Wil - Ps,^[^^,s])iTi':\K))\\ = hm d{T^l-\K), B,,[w^, 6]) = 0. (29) 



26 



Moreover, notice that for all n> zq such that /i„ ^ lev<o O^, by (21) we obtain that 
\K - Tt\K)\\ = \\K- K + A„ ®"i^"L el(/iJ|| = \n,MP^ < 2 



Take lim„_^oo on both sides of this inequality, and recall (25) to easily verify that 

hm \\hn-T^t"\f^n)\\=0. (30) 
Notice, now, that V/ G B£^[Wn,S], the triangle inequality implies that 

||^n-/|| < \\hn-T^':Hhn)\\ + \\Ti'':\hn)-f\\. 

If we take mif(=Bi,^[wn,s] on both sides of the previous inequality, then 

\/n G Z>o, d{hn,BeAwn^6]) < \\hn - T^l"\hn)\\ + d{T^'':\hn), BeAwn, 6]). 

Take, now, lim„_>.oo on both sides of this inequality, and use (29) and (30) to establish Theo- 
rem 2.3. 

(4) Next, let Assumptions 1, 2, and 3 hold true. By (23) notice that Vra > zo, V/ G Q, 
e" 2 — A 

— \\hn - hn+lf < Y^\\hn " K+lf < ||^„ - ff - \\hn+l - ff- 

This and Fact 2 suggest that G M.^ : lim„_5.oo hn = i.e., {h^} = £((/i„)„gz>o)- 

Now, in order to establish Theorem 2.4, let Assumptions 1, 2, 3, and 4 hold true. No- 
tice that the existence of the unique cluster point is guaranteed by the previously proved 
claim. To prove Theorem 2.4, we will use contradiction. In other words, assume that ^ 
\immfn^aof]j^j„Sj[e]. This clearly implies that ^ liminf„^oo fljejn "^jH- ^^^^ 
compact notations, we define here Vn G Z>o, \l/„ := Cljej^ ^jl'A- 

Note that the set liminf„,_^oo^n. is convex. This comes from the fact that and f]m>n'^rn 
are convex, Vn G Z>o, and that Vn G Z>o, nm>„, C r\m>n+i 

Since by our initial assumption int nn>zo '^^'^ always find an / and a p > such 

that B{f, p) C n„>,o Hence, 

Vn>^o, 5(/,p)c^„. (31) 



Notice, here, that / G nn>2o ^ Un6Z>o nm>„ *m =: lim inf „^oo ^'n C liminf 
Using this, our initial assumption on /i*, and the fact that lim inf „_^oo is closed and convex. 



then we can always find a t G (0, 1) such that ft := f + t{h^ — f) ^ liminf„_>.oo^I/n- This implies, 
by the definition of liminf „^oo ^n, that 

Vn > zo, ft^ f] (32) 

m>n 

Now, since lim„_j.oo hn = h^, there exists a zi G Z>o such that Vn > 2:1, — < . 

If we set n equal to maxj^o, Zi} in (31) and (32), then we readily verify that 3no G Z>o such 

that no > max{zo,2;i}, B{f,p) C = flje^no '^'iH and ^ The result ft ^ is 
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obviously equivalent to: 3jo G Jn^, such that ft ^ '^^^[e]. Also, notice that B{f,p) C 5'j(,[e]. 
Hence, Fact 3 suggests that d{h^, Sj^^[e]) > Y*"* • 

Using the triangle inequality — /|| < — hnj\ + — f\\, V/ G 5'jo[e], we obtain 
the following: d{hn„S,,[e]) > dCK,S,M - \\K - h^.W > ^ - ^ = ^ =: 7 > 0. 
This clearly implies that ma.x{d{hno, Sj[e]) : j G J'no} > 7 > 0. Set, now, n equal to no + 1 
in (31) and (32), and verify, as we did before, that 3ni G Z>o such that ma:x{d{hn-^, Sj[e]) : 
j G i/ni} > 7 > 0. Going on this way, we can construct a sequence (/i„j.)feez>o such that 
VA; G Z>o, max{d{hnf., Sj[e]) : j G > 7 > 0. However, this contradicts Theorem 2.2. 

Since we have reached a contradiction, this means that our initial assumption is wrong, and 

that G liminfn^oo CljeJn '^'iH- 

If we follow exactly the same procedure, as we did before, for the case of the sequence of sets 
{Be^[wn,S])n(^z>o^ then we obtain also K G limm{n^ooBi^[wn, S]. 
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